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ABSTRACT 

Parallel matrix multiplication is one of the most studied fun- 
damental problems in distributed and high performance com- 
puting. We obtain a new parallel algorithm that is based on 
Strassen's fast matrix multiplication and minimizes communi- 
cation. The algorithm outperforms all known parallel matrix 
multiplication algorithms, classical and Strassen-based, both 
asymptotically and in practice. 

A critical bottleneck in parallelizing Strassen's algorithm 
is the communication between the processors. Ballard, Dem- 
mel, Holtz, and Schwartz (SPAA'll) prove lower bounds on 
these communication costs, using expansion properties of the 
underlying computation graph. Our algorithm matches these 
lower bounds, and so is communication-optimal. It exhibits 
perfect strong scaling within the maximum possible range. 

Benchmarking our implementation on a Cray XT4, we 
obtain speedups over classical and Strassen-based algorithms 
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ranging from 24% to 184% for a fixed matrix dimension 
n = 94080, where the number of nodes ranges from 49 to 
7203. 

Our parallelization approach generalizes to other fast ma- 
trix multiplication algorithms. 

Categories and Subject Descriptors: F.2.1 [Analysis of 
Algorithms and Problem Complexity]: Numerical Algorithms 
and Problems: Computations on matrices 
ACM General Terms: algorithms 

Keywords: parallel algorithms, communication-avoiding 
algorithms, fast matrix multiplication 



1. INTRODUCTION 

Matrix multiplication is one of the most fundamental al- 
gorithmic problems in numerical linear algebra, distributed 
computing, scientific computing, and high-performance com- 
puting. Parallelization of matrix multiplication has been 
extensively studied {e.g., [9] [6) [II] [I] [25] [20] [34] [To] [23] [30j [2] 
[3]). It has been addressed using many theoretical approaches, 
algorithmic tools, and software engineering methods in order 
to optimize performance and obtain faster and more efficient 
parallel algorithms and implementations. 

We obtain a new parallel algorithm based on Strassen's 
fast matrix multiplication^] It is more efficient than any 
other parallel matrix multiplication algorithm of which we 
are aware, including those that are based on classical (0(n 3 )) 
multiplication, and those that are based on Strassen's and 
other Strassen-like matrix multiplications. We compare the 
efficiency of the new algorithm with previous algorithms, 
and provide both asymptotic analysis (Sections [3] and [4| and 
benchmarking data (Section [5]. 

1.1 The communication bottleneck 

To design efficient parallel algorithms, it is necessary not 
only to load balance the computation, but also to minimize 
the time spent communicating between processors. The 
inter-processor communication costs are in many cases sig- 
nificantly higher than the computational costs. Moreover, 
hardware trends predict that more problems will become 
communication-bound in the future [l9 18 . Even matrix 
multiplication becomes communication-bound when run on 
sufficiently many processors. Given the importance of com- 
munication costs, it is preferable to match the performance 
of an algorithm to a communication lower bound, obtaining 
a communication-optimal algorithm. 

1.2 Communication costs of matrix multipli- 
cation 

We consider a distributed-memory parallel machine model 
as described in Section 12.11 The communication costs are 
measured as a function of the number of processors P, the 
local memory size M in words, and the matrix dimension 
n. Irony, Toledo, and Tiskin |23 proved that in the dis- 
tributed memory parallel model, the bandwidth cost of clas- 
sical n-by-n matrix multiplication is bounded by Q, ( pj ^i/ 2 j 
words. Using their technique one can also deduce a memory- 
independent bandwidth cost bound of fl ( p " /3 ^ 3 and gener- 
alize it to other classes of algorithms [5]. For a shared-memory 
model similar bounds were shown in [21 . Until recently, par- 
allel classical matrix multiplication algorithms (e.g., "2D" [9] 
[34], and "3D" [6[ "~ 
for specific M va 



1]) have minimized communication only 
ues. The first algorithm that minimizes 
the communication costs for the entire range of M has re- 
cently been obtained by Solomonik and Demmel [30] . See 
Section \4. II for more details. 

None of these lower bounding techniques and paralleliza- 
tion approaches generalize to fast matrix multiplication, such 
as[3^[2^[7l[2ll[2^[l31[Ml[T4l[T2l A communication 



cost lower bound for fast matrix multiplication algorithms 
has only recently been obtained [2[: Strassen's algorithm run 
on a distributed-memory parallel machine has bandwidth 



x Our actual implementation uses the Winograd variant 
see Appendix [A] for details. 



:!(> 



costQ^^)" • f) and latency cost ((-^7x72)"° ' p)> 
where ujo = log 2 7 (see Section [2~4| . These bounds generalize 
to other, but not all, fast matrix multiplication algorithms, 
with coo being the exponent of the computational complexity. 

In the sequential case[^]the lower bounds are attained by 
the natural recursive implementation [32] which is thus opti- 
mal. However, a parallel communication-optimal Strassen- 
based algorithm was not previously known. Previous parallel 
algorithms that use Strassen {e.g., 20, 25] [17]), decrease the 
computational costs at the expense of higher communication 
costs. The factors by which these algorithms exceed the 
lower bounds are typically small powers of P and M, as 
discussed in Section [4] However both P and M can be large 
{e.g. on a modern supercomputer, one may have P ~ 10 5 
and M ~ 10 9 ). 

1.3 Parallelizing Strassen's matrix multipli- 
cation in a communication efficient way 

The main impetus for this work was the observation of the 
asymptotic gap between the communication costs of existing 
parallel Strassen-based algorithms and the communication 
lower bounds. Because of the attainability of the lower 
bounds in the sequential case, we hypothesized that the gap 
could be closed by finding a new algorithm rather than by 
tightening the lower bounds. 

We made three observations from the lower bound results 
of E] that lead to the new algorithm. First, the lower bounds 
for Strassen are lower than those for classical matrix multi- 
plication. This implies that in order to obtain an optimal 
Strassen-based algorithm, the communication pattern for an 
optimal algorithm cannot be that of a classical algorithm but 
must reflect the properties of Strassen's algorithm. Second, 
the factor M w °/ 2_1 that appears in the denominator of the 
communication cost lower bound implies that an optimal al- 
gorithm must use as much local memory as possible. That is, 
there is a tradeoff between memory usage and communication 
(the same is true in the classical case). Third, the proof of the 
lower bounds shows that in order to minimize communication 
costs relative to computation, it is necessary to perform each 
sub-matrix multiplication of size 0(\/M) x Q{s/~M) on a 
single processor. 

With these observations and assisted by techniques from 
previous approaches to parallelizing Strassen, we developed 
a new parallel algorithm which achieves perfect load balance, 
minimizes communication costs, and in particular performs 
asymptotically less computation and communication than is 
possible using classical matrix multiplication. 

1.4 Our contributions and paper organization 

Our main contribution is a new algorithm we call 
Communication- Avoiding Parallel Strassen, or CAPS. 

Theorem 1.1. CAPS asymptotically minimizes computa- 
tional and bandwidth costs over all parallel Strassen-based 
algorithms. It also minimizes latency cost up to a logarithmic 
factor in the number of processors. 

CAPS performs asymptotically better than any previous 
previous classical or Strassen-based parallel algorithm. It also 
runs faster in practice. The algorithm and its computational 
and communication cost analyses are presented in Section [3] 
There we show it matches the communication lower bounds. 

2 See [i] for a discussion of the sequential memory model. 



We provide a review and analysis of previous algorithms 
in Section U We also consider two natural combinations of 
previously known algorithms (Sections |4.4| and |4.5| |. One of 
these new algorithms that we call "2.5D-Strassen" performs 
better than all previous algorithms, but is still not optimal, 
and performs worse than CAPS. 

We discuss our implementations of the new algorithms and 
compare their performance with previous ones in Section [5] 
to show that our new CAPS algorithm outperforms previ- 
ous algorithms not just asymptotically, but also in practice. 
Benchmarking our implementation on a Cray XT4, we ob- 
tain speedups over classical and Strassen-based algorithms 
ranging from 24% to 184% for a fixed matrix dimension 
n = 94080, where the number of nodes ranges from 49 to 
7203. 

In Section [6] we show that our parallelization method 
applies to other fast matrix multiplication algorithms. It 
also applies to classical recursive matrix multiplication, thus 
obtaining a new optimal classical algorithm that matches the 
2.5D algorithm of Solomonik and Demmel [30]. In Section [6] 
we also discuss numerical stability, hardware scaling, and 
future work. 

2. PRELIMINARIES 

2.1 Communication model 

We model communication of distributed-memory parallel 
architectures as follows. We assume the machine has P 
processors, each with local memory of size M words, which 
are connected via a network. Processors communicate via 
messages, and we assume that a message of w words can 
be communicated in time a + /3w. The bandwidth cost of 
the algorithm is given by the word count and denoted by 
BW(-), and the latency cost is given by the message count 
and denoted by L(-). Similarly the computational cost is 
given by the number of floating point operations and denoted 
by F(-). We call the time per floating point operation 7. 

We count the number of words, messages and floating point 
operations along the critical path as defined in [37]. That 
is, two messages that are communicated between separate 
pairs of processors simultaneously are counted only once, as 
are two floating point operations performed in parallel on 
different processors. This metric is closely related to the 
total running time of the algorithm, which we model as 

aL(-)+PBW(-)+jF(-). 

We assume that (1) the architecture is homogeneous (that 
is, 7 is the same on all processors and a and are the 
same between each pair of processors), (2) processors can 
send/receive only one message to/from one processor at a 
time and they cannot overlap computation with communi- 
cation (this latter assumption can be dropped, affecting the 
running time by a factor of at most two), and (3) there is no 
communication resource contention among processors. That 
is, we assume that there is a link in the network between 
each pair of processors. Thus lower bounds derived in this 
model are valid for any network, but attainability of the 
lower bounds depends on the details of the network. 

2.2 Strassen's algorithm 

Strassen showed that 2x2 matrix multiplication can be 
performed using 7 multiplications and 18 additions, instead 
of the classical algorithm that does 8 multiplications and 



4 additions [32]. By recursive application this yields an 
algorithm with multiplies two n x n matrices 0(n" ) flops, 
where a>o = log 2 7 ~ 2.81. Winograd improved the algorithm 
to use 7 multiplications and 15 additions in the base case, 
thus decreasing the hidden constant in the O notation 
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We review the Strassen- Winograd algorithm in Appendix | A| 

2.3 Previous work on parallel Strassen 

In this section we breifly describe previous efforts to par- 
allelize Strassen. More details, including communication 
analyses, are in Section [4] A summary appears in Table [l] 

Luo and Drake [25] explored Strassen-based parallel al- 
gorithms that use the communication patterns known for 
classical matrix multiplication. They considered using a clas- 
sical 2D parallel algorithm and using Strassen locally, which 
corresponds to what we call the "2D-Strassen" approach (see 
Section [4~2| . They also consider using Strassen at the highest 
level and performing a classical parallel algorithm for each 
subproblem generated, which corresponds to what we call the 
"Strassen-2D" approach. The size of the subproblems depends 
on the number of Strassen steps taken (see Section [4~3| . Luo 
and Drake also analyzed the communication costs for these 
two approaches. 

Soon after, Grayson, Shah, and van de Geijn [20] improved 
on the Strassen-2D approach of [25] by using a better classical 
parallel matrix multiplication algorithm and running on a 
more communication-efficient machine. They obtained better 
performance results compared to a purely classical algorithm 
for up to three levels of Strassen's recursion. 

Kumar, Huang, Johnson, and Sadayappan [24| imple- 
mented Strassen's algorithm on a shared-memory machine. 
They identified the tradeoff between available parallelism and 
total memory footprint by differentiating between "partial" 
and "complete" evaluation of the algorithm, which corre- 
sponds to what we call depth-first and breadth-first traversal 
of the recursion tree (see Section |3.1[ ). They show that by 
using £ DFS steps before using BFS steps, the memory foot- 
print is reduced by a factor of (7/4) f compared to using all 
BFS steps. They did not consider communication costs in 
their work. 

Other parallel approaches [17[ |22[ |3"T] have used more com- 
plex parallel schemes and communication patterns. However, 
they restrict attention to only one or two steps of Strassen 
and obtain modest performance improvements over classical 
algorithms. 

2.4 Strassen lower bounds 

For Strassen-based algorithms, the bandwidth cost lower 
bound has been proved using expansion arguments on the 
computation graph, and the latency cost lower bound is an 
immediate corollary. 

Theorem 2.1. (Memory-dependent lower bound) Con- 
sider a Strassen-based algorithm running on P processors 
each with local memory size M. Let BW(n, P, M) be the 
bandwith cost and L(n, P, M) be the latency cost of the al- 
gorithm. Assume that no intermediate values are computed 
twice. Then 



BW(n, P, M) = Q 



L(n, P, M) = Q 



M 
~P 



n 
M 



A memory-independent lower bound has recently been 
proved using the same expansion approach: 

Theorem 2.2. (Memory-independent lower bound) 
Consider a Strassen-based algorithm running on P processors. 
Let BW(n, P) be the bandwith cost and L(n, P) be the latency 
cost of the algorithm. Assume that no intermediate values 
are computed twice. Assume only one copy of the input data 
is stored at the start of the algorithm and the computation is 
load-balanced in an asymptotic sense. Then 



BW{n,P) = n 



p2/u 



and the latency cost is L(n,P) — fi(l). 

Note that when M = 0(n 2 /P 2/u °), the memory- 
dependent lower bound is dominant, and when M — 
r2(n 2 /P 2 ^°), the memory-independent lower bound is domi- 
nant. 



3. COMMUNICATION-AVOIDING 
PARALLEL STRASSEN 

In this section we present the CAPS algorithm, and prove 
it is communication-optimal. See Algorithm [l] for a concise 
presentation and Algorithm [2] for a more detailed description. 

Theorem 3.1. CAPS has computational cost 9 ( z i^) > 
bandwidth cost O (max | pM " ° 2 -i > P "/^ jj > an d latency 
cost 9 (max { log P, log p}) . 

By Theorems |2.1| and |2.2[ we see that CAPS has optimal 
computational and bandwidth costs, and that its latency 
cost is at most logP away from optimal. Thus Theorem |1.1| 
follows. We prove Theorem |3.1| in Section [3. 5| 



3.1 Overview of CAPS 

Consider the recursion tree of Strassen's sequential algo- 
rithm. CAPS traverses it in parallel as follows. At each level 
of the tree, the algorithm proceeds in one of two ways. A 
"breadth- first-step" (BFS) divides the 7 subproblems among 
the processors, so that = of the processors work on each sub- 
problem independently and in parallel. A "depth-first-step" 
(DFS) uses all the processors on each subproblem, solving 
each one in sequence. See Figure [I] 

In short, a BFS step requires more memory but reduces 
communication costs while a DFS step requires little extra 
memory but is less communication-efficient. In order to 
minimize communication costs, the algorithm must choose 
an ordering of BFS and DFS steps that uses as much memory 
as possible. 

Let k = log 7 P and s > k be the number of distributed 
Strassen steps the algorithm will take. In this section, we 
assume that n is a multiple of 2"7^ k ^ 2 ^ . If k is even, the 
restriction simplifies to n being a multiple of 2"\fP. Since 
P is a power of 7, it is sometimes convenient to think of 
the processors as numbered in base 7. CAPS performs s 
steps of Strassen's algorithm and finishes the calculation 
with local matrix multiplication. The algorithm can easily 
be generalized to other values of n by padding or dynamic 
peeling. 





Figure 1: Representation of BFS and DFS steps. In 
a BFS step, all seven subproblems are computed at 
once, each on 1/7 of the processors. In a DFS step, 
the seven subproblems are computed in sequence, 
each using all the processors. The notation follows 
that of Appendix \A\ 



We consider two simple schemes of traversing the recursion 
tree with BFS and DFS steps. The first scheme, which we 
call the Unlimited Memory (UM) scheme, is to take k BFS 
steps in a row. This approach is possible only if there is 
sufficient available memory. The second scheme, which we 
call the Limited Memory (LM) scheme is to take i DFS 
steps in a row followed by k BFS steps in a row, where I is 
minimized subject to the memory constraints. 

It is possible to use a more complicated scheme that in- 
terleave BFS and DFS steps to reduce communication. We 
show that the LM scheme is optimal up to a constant factor, 
and hence no more than a constant factor improvement can 
be attained from interleaving. 



Algorithm 1 CAPS, in brief. 
rithmH 



For more details, see Algo- 



Input: A, B, n, where A and B are n x n matrices 

P — number of processors 
Output: C = A ■ B 

1: procedure C = CAPS(A, B, n, P) 

2: if enough memory then > Do a BFS step 

3: locally compute the Si's and T/s from A and B 

4: parallel for i = 1 ... 7 do 

5: redistribute Si and Ti 

6: Qi = CAPS(Si, T h n/2, P/7) 

7: redistribute Qi 

8: locally compute C from all the QiS 

9: else > Do a DFS step 

10: for i = 1 ... 7 do 

11: locally compute Si and Ti from A and B 

12: Qi = CAPS(S 8 , %, n/2, P) 

13: locally compute contribution of Qi to C 

> The dependence of the Si's on A, the TVs on B and C 
on the Qi's follows the Strassen or Strassen- Winograd 
algorithm. See Appendix [A] 

3.2 Data layout 

We require that the data layout of the matrices satisfies 
the following two properties: 

1. At each of the s Strassen recursion steps, the data 
layouts of the four sub-matrices of each of A, B, and 
C must match so that the weighted additions of these 
sub-matrices can be performed locally. This technique 
follows [25] and allows communication-free DFS steps. 



2. Each of these submatrices must be equally distributed 
among the P processors for load balancing. 

There are many data layouts that satisfy these properties, per- 
haps the simplest being block-cyclic layout with a processor 
grid of size 7 Lfe/2J x 7 rfc/21 and block size 2s7 " k /2J x - " fc /2 l . 
(When fc = log 7 P is even these expressions simplify to a 
processor grid of size yP x vP and block size 23 r \p ■) See 
Figure [2] 

Any layout that we use is specified by three parameters, 
(n,P,s), and intermediate stages of the computation use the 
same layout with smaller values of the parameters. A BFS 
step reduces a multiplication problem with layout parame- 
ters (n, P, s) to seven subproblems with layout parameters 
(n/2, P/7, s — 1). A DFS step reduces a multiplication prob- 
lem with layout parameters (n, P, s) to seven subproblems 
with layout parameters (n/2, P,s — 1). 

Note that if the input data is initially load-balanced but 
distributed using a different layout, we can rearrange it to 
the above layout using a total of O (jjr^ words and 0(n 2 ) 
messages. This has no asymptotic effect on the bandwidth 
cost but significantly increases the latency cost in the worst 



Algorithm 2 CAPS, in detail 



Input: A, B, axe n x n matrices 

P — number of processors 

rank = processor number base-7 as an array 

AI — local memory size 
Output: C = A ■ B 
1: procedure C = CAPS(A, B, P, rank, M) 

2: e=\lo g2 pl/ 4" Ml/2 ] 

> £ is number of DFS steps to fit in memory 
3: k = log 7 P 

4: call DFS(A, B, C, fc, I, rank) 

1: procedure DFS(A, B, C, fc, I, rank) 

> Do C = A ■ B by £ DFS, then fc BFS steps 
2: if £ < then call BFS( A, B, C, fc, rank); return 
3: for i = 1 . . . 7 do 

4: locally compute Si and Ti from A and B 

> following Strassen's algorithm 
5: call DFS( Si, Ti, Qi, k,£~l, rank ) 

6: locally compute contribution of Qi to C 

> following Strassen's algorithm 
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Figure 2: An example matrix layout for CAPS. Each 
of the 16 submatrices as shown on the left has ex- 
actly the same layout. The colored blocks are the 
ones owned by processor 00. On the right is a 
zoomed-in view of one submatrix, showing which 
processor, numbered base 7, owns each block. This 
is block-cyclic layout with some blocksize b, and 
matches our layout requirements with parameters 
(n = 4- 49- b,P = 49, s = 2). 



3.3 Unlimited Memory scheme 

In the UM scheme, we take fc = log 7 P BFS steps in a 
row. Since a BFS step reduces the number of processors 
involved in each subproblem by a factor of 7, after fc BFS 
steps each subproblem is assigned to a single processor, and 
so is computed locally with no further communication costs. 
We first describe a BFS step in more detail. 

The matrices A and B are initially distributed as described 
in Section [3~2] In order to take a recursive step, the 14 matri- 
ces Si, . . . St, Ti, . . . , Tt must be computed. Each processor 
allocates space for all 14 matrices and performs local addi- 
tions and subtractions to compute its portion of the matrices. 
Recall that the submatrices are distributed identically, so 
this step requires no communication. If the layouts of A 
and B have parameters (n, P, s), the Si and the Ti now have 
layout parameters (n/2, P,s — 1). 



1: procedure BFS(A, B, C, fc, rank) 

t> Do C = A ■ B by k BFS steps, then local Strassen 
2: if fc == then call localStrassen(A, B, C); return 
3: for i = 1 . . . 7 do 

4: locally compute Si and Ti from A and B 

> following Strassen's algorithm 

for £ = 1 ... 7 do 

target = rank 
target [fc] = i 
send Si to target 
receive into L 

> One part of L comes from each of 7 processors 
10: send Ti to target 

11: receive into R 

> One part of R comes from each of 7 processors 
12: call BFS(L, R, P, k — 1, rank ) 

13: for i = 1 ... 7 do 

14: target = rank 

15: target [fc] = i 

16: send i th part of P to target 

17: receive from target into Qi 

18: locally compute C from Qi 

> following Strassen's algorithm 



The next step is to redistribute these 14 matrices so that 
the 7 pairs of matrices (Si,Ti) exist on disjoint sets of P/7 
processors. This requires disjoint sets of 7 processors perform- 
ing an all-to-all communication step (each processor must 
send and receieve a message from each of the other 6). To 
see this, consider the numbering of the processors base-7. On 
the m th BFS step, the communication is between the seven 
processors whose numbers agree on all digits except the m th 
(counting from the right). After the m th BFS step, the set 
of processors working on a given subproblem share the same 
m-digit suffix. After the above communication is performed, 
the layout of Si and Ti has parameters (n/2, P/7, s — 1), and 
the sets of processors that own the Ti and Si are disjoint 



for different values of i. Note that since each all-to-all only 
involves seven processors no matter how large P is, this 
algorithm does not have the scalability issues that typically 
come from an all-to-all communication pattern. 

3.3.1 Memory requirements 

The extra memory required to take one BFS step is the 
space to store all 7 triples Sj, Tj, Qj. Since each of those 
matrices is \ the size of A, B, and C, the extra space 
required at a given step is 7/4 the extra space required for 
the previous step. We assume that no extra memory is 
required for the local multiplications]^] Thus, the total local 
memory requirement for taking k BFS steps is given by 



Mem UM (n, P) 



3n 
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e 
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3.3.2 Computational costs 

The computation required at a given BFS step is that of the 
local additions and subtractions associated with computing 
the Si and Ti and updating the output matrix C with the 
Qi. Since Strassen performs 18 additions and subtractions, 
the computational cost recurrence is 



F VM (n,P) = 18 



n 

4P 



FxjM 



n P 
2' 7 



with base case Pum(^, 1) = c s n"° — 6n 2 , where c s is the 
constant of Strassen's algorithm. See Appendix |A"] for more 
details. The solution to this recurrence is 



Fvm (n,P) = 



6n 2 



= e 



3.3.3 Communication costs 

Consider the communication costs associated with the 
UM scheme. Given that the redistribution within a BFS 
step is performed by an all-to-all communication step among 
sets of 7 processors, each processor sends 6 messages and 
receives 6 messages to redistribute Si, . . . , SV, and the same 
for Ti, . . . , TV. After the products Qi = SiT are computed, 
each processor sends 6 messages and receive 6 messages to 
redistribute Qi, . . . ,Q?. The size of each message varies 
according to the recursion depth, and is the number of words 

2 

a processor owns of any Si, T, or Qi, namely words. 

As each of the Qi is computed simultaneously on disjoint 
sets of P/7 processors, we obtain a cost recurrence for the 
entire UM scheme: 

n 2 ( n P N 

BWvM{n,P) = 36—+BW V M 



4P 



L VM (n, P) = 36 + L\_ 



2' 7 



2' 7 



with base case Lum(«, 1) = BW\jM(n, 1) = 0. Thus 



BWvM(n,P) = 



Yin 2 



Yin 2 



= e 



p2/u P V p2/u, 

iuM(n,P) = 361og 7 P = e(logP). (1) 



3 If one does not overwrite the input, it is impossible to run 
Strassen in place; however using a few temporary matrices 
affects the analysis here by a constant factor only. 



3.4 Limited Memory scheme 

In this section we discuss a scheme for traversing Strassen's 
recursion tree in the context of limited memory. In the LM 
scheme, we take £ DFS steps in a row followed by k BFS 
steps in a row, where £ is minimized subject to the memory 
constraints. That is, we use a sequence of DFS steps to 
reduce the problem size so that we can use the UM scheme 
on each subproblem without exceeding the available memory. 

Consider taking a single DFS step. Rather than allocating 
space for and computing all 14 matrices Si, Ji, . . . ,St , T7 at 
once, the DFS step requires allocation of only one subproblem, 
and each of the Qi will be computed in sequence. 

Consider the i th subproblem: as before, both Si and Ti 
can be computed locally. After Qi is computed, it is used to 
update the corresponding quadrants of C and then discarded 
so that its space in memory (as well as the space for Si and 
Ti) can be re-used for the next subproblem. In a DFS step, 
no redistribution occurs. After Si and T are computed, all 
processors participate in the computation of Qi. 

We assume that some extra memory is available. To be 
precise, assume the matrices A, B, and C require only ~ of 
the available memory: 

3 - 2 ' 1 - (2) 



In the LM scheme 
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The following subsection shows that this choice of £ is 
sufficient not to exceed the memory capacity. 

3.4.1 Memory requirements 

The extra memory requirement for a DFS step is the space 
to store one subproblem. Thus, the extra space required at 
this step is 1/4 the space required to store A, B, and C. The 
local memory requirements for the LM scheme is given by 
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where the last line follows from Q and (J2j) . Thus, the limited 
memory scheme does not exceed the available memory. 

3.4.2 Computational costs 

As in the UM case, the computation required at a given 
DFS step is that of the local additions and subtractions 
associated with computing the Si and T and updating the 
output matrix C with the Qi. However, since all processors 
participate in each subproblem and the subproblems are 
computed in sequence, the recurrence is given by 



Plm (n,P) = 18 



4P 



7 • Ft. 
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After £ steps of DFS, the size of a subproblems is |y x |y, 
and there are P processors involved. We take k BFS steps 
to compute each of these 7 e subproblems. Thus 

Plm(J,p) =F UM (£,P), 



and 



P "\P 



3.4.5 Communication costs 

Since there are no communication costs associated with a 
DFS step, the recurrence is simply 

BWlm (n, P) = 7 • BWlm (», p) 

LhM(n, P) = 7 ■ Z/lm (~ p) 

with base cases 

£W L m(J,p) =BWum(J,p) 

L LM (|,P)=L UM g I P). 
Thus the total communication costs are given by 
BW LM (n,P) = 7 l - BWvm ( J , p) 
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3.5 Communication optimality 



PROOF, (o/ Theorem \3.1\ . In the case that M > 
MemuM(n,P) = H f P 2/^„ ) the UM scheme is possible. 
Then the communication costs are given by |TJ which matches 
the lower bound of Theorem 12.21 Thus the UM scheme is 
communication-optimal (up to a logarithmic factor in the la- 
tency cost and assuming that the data is initially distributed 
as described in Section 3.2|. For smaller values of M, the 



LM scheme must be used. Then the communication costs 



2.1 



are given by Q and match the lower bound of Theorem 
so the LM scheme is also communication-optimal. □ 



We note that for the LM scheme, since both the compu- 
tational and communication costs are proportional to -p, 
we can expect perfect strong scaling: given a fixed problem 
size, increasing the number of processors by some factor 
will decrease each cost by the same factor. However, this 
strong scaling property has a limited range. As P increases, 
holding everything else constant, the global memory size 
PM increases as well. The limit of perfect strong scaling is 
exactly when there is enough memory for the UM scheme. 
See [3] for details. 

4. ANALYSIS OF OTHER ALGORITHMS 

In the section we detail the asymptotic communication 
costs of other matrix multiplication algorithms, both classical 



and Strassen-based. These communication costs and the 
corresponding lower bounds are summarized in Table [T] 

Many of the algorithms described in this section are hybrids 
of two different algorithms. We use the convention that the 
names of the hybrid algorithms are composed of the names of 
the two component algorithms, hyphenated. The first name 
describes the algorithm used at the top level, on the largest 
problems, and the second describes the algorithm used at 
the base level on smaller problems. 

4.1 Classical Algorithms 

Classical algorithms must communicate asymptotically 
more than an optimal Strassen-based algorithm. To compare 
the lower bounds, it is necessary to consider three cases 
for the memory size: when the memory-dependent bounds 
dominate for both classical and Strassen, when the memory- 
dependent bound dominates for classical, but the memory- 
independent bound dominates for Strassen, and when the 
memory-independent bounds dominate for both classical and 
Strassen. This analysis is detailed in Appendix [B] Briefly, 
the factor by which the classical bandwidth cost exceeds 
the Strassen bandwidth cost is P a where a ranges from 



0.046 to 



0.10 depending on the relative 



_2 

problem size. The same sort of analysis is used throughout 
Section [1] to compare each algorithm with the Strassen-based 
lower bounds. 

Various parallel classical matrix multiplication algorithms 
minimize communication relative to the classical lower 
bounds for certain amounts of local memory M. For ex- 
ample, Cannon's algorithm M minimizes communication for 
M = 0(n 2 /P). Several more practical algorithms exist (such 
as SUMMA [34]) which use the same amount of local memory 
and have the same asymptotic communication costs. We 
call this class of algorithms "2D" because the communication 
patterns follow a two-dimensional processor grid. 

Another class of algorithms, known as "3D" [s| [TJ because 
the communication pattern maps to a three-dimensional 
processor grid, uses more local memory and reduces com- 
munication relative to 2D algorithms. This class of algo- 
rithms minimizes communication relative to the classical 
lower bounds for M — Q.(n 2 /P 2 / 3 ). As shown in [3], it is 
not possible to use more memory than M = 0(n 2 /P^ 3 ) to 
reduce communication. 

Recently, a more general algorithm has been developed 
which minimizes communication in all cases. Because it 
reduces to a 2D and 3D for the extreme values of M but 
interpolates for the values between, it is known as the "2.5D" 
algorithm [30] . 

4.2 2D-Strassen 

One idea to parallelize Strassen-based algorithms is to 
use a 2D classical algorithm for the inter-processor commu- 
nication, and use the fast matrix multiplication algorithm 
locally 25 . We call such an algorithm "2D-Strassen". It 
is straightforward to implement, but cannot attain all the 
computational speedup from Strassen since it uses a classi- 
cal algorithm for part of the computation. In particular, it 
does not use Strassen for the largest matrices, when Strassen 
provides the greatest reduction in computation. As a re- 
sult, the computational cost exceeds B(n w °/P) by a factor 
of p( 3 -"o)/2 w po.10 The 2D-Strassen algorithm has the 
same communication cost as 2D algorithms, and hence does 
not match the communication costs of CAPS. In comparing 
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Table 1: Asymptotic matrix multiplication computational and communication costs of algorithms and corre- 
sponding lower bounds. Here uiq = log 2 7 ~ 2.81 is the exponent of Strassen; I is the number of Strassen steps 
taken. None of the Strassen-based algorithms except for CAPS attain the lower bounds of Section |2.4[ see 
Section [4] for a discussion of each. 



the 2D-Strassen bandwidth cost, Q(n 2 /P 1/2 ), to the CAPS 
bandwidth cost in Section [3] note that for the problem to fit 
in memory we always have M = Q(n 2 /P). The bandwidth 
cost exceeds that of CAPS by a factor of P a , where a ranges 
from (3 — uio)/2 « .10 to 2/cjo — 1/2 ~ .21, depending on the 
relative problem size. Similarly, the latency cost, Q(P 1 ^ 2 ), 
exceeds that of CAPS by a factor of P a where a ranges from 
(3 - w )/2 w .10 to 1/2 = .5. 

4.3 Strassen-2D 

The "Strassen-2D" algorithm applies I DFS steps of 
Strassen's algorithm at the top level, and performs the 7 
smaller matrix multiplications using a 2D algorithm. By 
choosing certain data layouts as in Section [3. 2| it is possible 
to do the additions and subtractions for Strassen's algorithm 
without any communication [25]. However, Strassen-2D is 
also unable to match the communication costs of CAPS. 
Moreover, the speedup of Strassen-2D in computation comes 
at the expense of extra communication. For large numbers 
of Strassen steps £, Strassen-2D can approach the compu- 
tational lower bound of Strassen, but each step increases 
the bandwidth cost by a factor of | and the latency cost 
by a factor of 7. Thus the bandwidth cost of Strassen-2D is 

/ 7 \ £ 

a factor of (^J higher than 2D-Strassen, which is already 
higher than that of CAPS. The latency cost is even worse: 
Strassen-2D is a factor of 7 higher than 2D-Strassen. 

One can reduce the latency cost of Strassen-2D at the 
expense of a larger memory footprint. Since Strassen-2D 
runs a 2D algorithm 7 e times on the same set of processors, 
it is possible to pack together messages from independent 
matrix multiplications. In the best case, the latency cost 
is reduced to the cost of 2D-Strassen, which is still above 
that of CAPS, at the expense of using a factor of more 
memory. 

4.4 2.5D-Strassen 



A natural idea is to replace a 2D classical algorithm in 
2D-Strassen with the superior 2.5D classical algorithm to 
obtain an algorithm we call 2.5D-Strassen. This algorithm 
uses the 2.5D algorithm for the inter-processor communi- 
cation, and then uses Strassen for the local computation. 
When M = Q(n 2 /P), 2.5D-Strassen is exactly the same 
as 2D-Strassen, but when there is extra memory it both 
decreases the communication cost and decreases the com- 
putational cost since the local matrix multiplications are 
performed (using Strassen) on larger matrices. To be precise, 
the computational cost exceeds the lower bound by a factor 
of P a where a ranges from 1 - f - w 0.064 to ^fl « 0.10 
depending on the relative problem size. The bandwidth cost 
exceeds the bandwidth cost of CAPS by a factor of P a where 
a ranges from - -fas 0.046 to ^» ^ 0.10. In terms of 

latency, the cost of p ^ l3 / 2 + l°g P exceeds the latency cost of 
CAPS by a factor ranging from logP to p( 3 -"o)/ 2 ~ pO iO ) 
depending on the relative problem size. 



4.5 Strassen-2.5D 

Similarly, by replacing a 2D algorithm with 2.5D in 
Strassen-2D, one obtains the new algorithm we call Strassen- 
2.5D. First one takes I DFS steps of Strassen, which can be 
done without communication, and then one applies the 2.5D 
algorithm to each of the 7 subproblems. The computational 
cost is exactly the same as Strassen-2D, but the communica- 
tion cost will typically be lower. Each of the 7 e subproblems 
is multiplication of n/2 e x n/2 l matrices. Each subproblem 
uses only l/4 f as much memory as the original problem. 
Thus there may be a large amount of extra memory available 
for each subproblem, and the lower communication costs of 
the 2.5D algorithm help. The choice of I that minimizes the 



bandwidth cost is 



£ opt = max {O, [log 2 M1/ " pl/3 ] } • 
The same choice minimizes the latency cost. Note that 

2 

when M > " /3 , taking zero Strassen steps minimizes the 
communication within the constraints of the Strassen-2.5D 



£ op t, the bandwidth cost is a fac- 



algorithm. With £ 

tor of P 1 -^/ 3 fa P 064 above that of CAPS. Additionally, 
the computational cost is not optimal, and using I = £ op t, 
the computational cost exceeds the optimal by a factor of 

pl-t*Jo/3^3/2-^o/2 w pO. 064^-0. 096 

It is also possible to take £ > £ op t steps of Strassen to 
decrease the comptutational cost further. However the de- 
creased computational cost comes at the expense of higher 
communication cost, as in the case of Strassen-2D. In partic- 
ular, each extra step over £ opt increases the bandwidth cost 
by a factor of | and the latency cost by a factor of 7. As 
with Strassen-2D, it is possible to use extra memory to pack 
together messages from several subproblems and decrease 
the latency cost, but not the bandwidth cost. 



5. PERFORMANCE RESULTS 

We have implemented CAPS using MPI on a Cray XT4, 
and compared it to various previous classical and Strassen- 
based algorithms. The benchmarking data is shown in Fig- 
ure [3] 

5.1 Experimental setup 

The nodes of the Cray XT4 have 8GB of memory and a 
quad-core AMD "Bupdapest" processor running at 2.3GHz. 
We treat the entire node as a single processor, and when we 
use the classical algorithm we call the optimized threaded 
BLAS in Cray's LibSci to provide parallelism between the 
four cores in a node. The peak flop rate is 9.2 GFLOPS per 
core, or 36.8 GFLOPS per node. The machine consists of 
9,572 nodes. All the data in Figure [3] is for multiplying two 
square matrices with n — 94080. 

5.2 Performance 

Note that the vertical scale of Figure [3] is "effective 
GFLOPS", which is a useful measure for comparing classical 
and fast matrix multiplication algorithms. It is calculated as 



Effective GFLOPS = 



2n A 



(Execution time in seconds) 10 9 ' 



(5) 

For classical algorithms, which perform 2n 3 floating point 
operations, this gives the actual GFLOPS. For fast matrix 
multiplication algorithms it gives the performance relative 
to classical algorithms, but does not accurately represent the 
number of floating point operations performed. 

Our algorithm outperforms all previous algorithms, and 
attains performance as high as 49.1 effective GFLOPS/node, 
which is 33% above the theoretical maximum for all classical 
algorithms. Compared with the best classical implementa- 
tion, our speedup ranges from 51% for small values of P up 
to 94% when using most of the machine. Compared with 
the best previous parallel Strassen algorithms, our speedup 
ranges from 24% up to 184%. Unlike previous Strassen algo- 
rithms, we are able to attain substantial speedups over the 
entire range of processors. 



5.3 Strong scaling 

Figure [3] is a strong scaling plot: the problem size is fixed 
and each algorithm is run with P ranging from the minimum 
that provides enough memory up to the largest allowed value 
of P smaller than the size of the machine. Perfect strong 
scaling corresponds to a horizontal line in the plot. As 
the communication analysis predicts, CAPS exhibits better 
strong scaling than any of the other algorithms (with the 
exception of ScaLAPACK, which obtains very good strong 
scaling by having poor performance for small values of P). 

5.4 Details of the implementations 

5.4.1 CAPS 

This implementation is the CAPS algorithm, with a few 
modifications from the presentation in Section[3] First, when 
computing locally it switches to classical matrix multiplica- 
tion below some size no. Second, it is generalized to run 
on P = c7 fc processors for c £ {1,2,3,6} rather than just 
7 k processors. As a result, the base-case classical matrix 
multiplication is done on c processors rather than 1. Finally, 
implementation uses the Winograd variant of Strassen; see 
Appendix [A] for more details. Every point in the plot is 
tuned to use the best interleaving pattern of BFS and DFS 
steps, and the best total number of Strassen steps. For points 
in the figure, the optimal total number of Strassen steps is 
always 5 or 6. 

5.4.2 ScaLAPACK 

We use ScaLAPACK [8] as optimized by Cray in LibSci. 
This is an implementation of the SUMMA algorithm, and 
can run on an arbitrary number of processors. It should 
give the best performance if P is a perfect square so the 
processors can be placed on a square 2D grid. All the runs 
shown in Figure [3] are with P a perfect square. 

5.4.3 2.5D classical 

This is the code of [30] . It places the P processors in 
a grid of size \J P/c x \J P/c x c, and requires that \fP~Jc. 
and c are integers with 1 < c < P 1 ^ 3 , and c divides y/ P/c. 
Additionally, it gets the best performance if c is as large as 
possible, given the constraint that c copies of the input and 
output matrices fit in memory. In the case that c = 1 this 
code is an optimized implementation of SUMMA. The values 
of P and c for the runs in Figure [3] are chosen to get the best 
performance. 

5.4.4 Strassen-2D 

Following the algorithm of [20| |25| , this implementation 
uses the DFS code from the implementation of CAPS at the 
top level, and then uses the optimized SUMMA code from 
the 2.5D implementation with c = 1. Since the classical code 
requires that P is a perfect square, this requirement applies 
here as well. The number of Strassen steps taken is tuned to 
give the best performance for each P value, and the optimal 
number varies from to 2. 

5.4.5 2D-Strassen 

Following the algorithm of [25], the 2D-Strassen implemen- 
tation is analagous to the Strassen-2D implementation, but 
with the classical algorithm run before taking local Strassen 
steps. Similarly the same code is used for local Strassen steps 
here and in our implementation of CAPS. This code also 




Figure 3: Strong scaling performance of various matrix multiplication algorithms on Cray XT4 for fixed 
problem size n = 94080. The top line is CAPS as described in Section |3j and substantially outperforms all the 
other classical and Strassen-based algorithms. The horizontal axis is the number of nodes in log-scale. The 
vertical axis is effective GFLOPS, which are a performance measure rather than a flop rate, as discussed in 
Section |5.2[ See Section |5.4| for a description of each implementation. 



requires that P is a perfect square. The number of Strassen 
steps is tuned for each P value, and the optimal number 
varies from to 3. 

5.4.6 2. 5D -Strassen 

This implementation uses the 2.5D implementation to 
reduce the problem to one processor, then takes several 
Strassen steps. The processor requirements are the same 
as for the 2.5D implementation. The number of Strassen 
steps is tuned for each number of processors, and the optimal 
number varies from to 3. We also tested the Strassen- 
2.5D algorithm, but its performance was always lower than 
2.5D-Strassen in our experiments. 



t is the machine precision. Strassen and other fast algorithms 
do not satisfy component-wise bounds but do satisfy the 
slightly weaker norm-wise bounds: |[C— (7|| < /(n)e||A||||B||, 
where = maxij Aij and / is polynomial in n [21] . 

Accuracy can be improved with the use of diagonal scal- 
ing matrices: D1CD3 = D1AD2 ■ D^ 1 BD3. It is possible 
to choose Di, D2, D3 so that the error bounds satisfy ei- 
ther ICy-Cy < f(n)e\\A(i, :)|| \\B(:, 3) || or \\C - C\\ < 
/(n)e|||A| • By scaling, the error bounds on Strassen 

become comparable to those of many other dense linear al- 
gebra algorithms, such as LU and QR decomposition [15] . 
Thus using Strassen for the matrix multiplications in a larger 
computation will often not harm the stability at all. 



6. CONCLUSIONS/FUTURE WORK 
6.1 Stability of fast matrix multiplication 

CAPS has the same stability properties as sequential ver- 
sions of Strassen. For a complete discussion of the stabil- 
ity of fast matrix multiplication algorithms, see 21 16 . 
We highlight a few main points here. The tightest error 
bounds for classical matrix multiplication are component- 
wise: \C— C\ < ne\ A\- 1 B\, where C is the computed result and 



6.2 Hardware scaling 

Although Strassen performs asymptotically less computa- 
tion and communication than classical matrix multiplication, 
it is more demanding on the hardware. That is, if one 
wants to run matrix multiplication near the peak CPU speed, 
Strassen is more demanding of the memory size and communi- 
cation bandwidth. This is because the ratio of computational 
cost to bandwidth cost is lower for Strassen than for classi- 
cal. From the lower bounds in Section [2.4] the asymptotic 



ratio of computational cost to bandwidth cost is M" 0,/2_1 
for Strassen-based algorithms, versus M 1 ' 2 for classical al- 
gorithms. This means that it is harder to run Strassen near 
peak than it is to run classical matrix multiplication near 
peak. In terms of the machine parameters /? and 7 introduced 
in Section |2.1 1 the condition to be able to be computation- 
bound is yM / 2 > c/3 for classical matrix multiplication and 
7M ^ /2-i > for Strassen Here c and c ' 

are constants 

that depend on the constants in the communication and 
computational costs of classical and Strassen-based matrix 
multiplication. 

The above inequalities may guide hardware design as long 
as classical and Strassen matrix multiplication are considered 
important computations. They apply both to the distributed 
case, where M is the local memory size and /3 is the inverse 
network bandwidth, and to the sequential/shared- memory 
case where M is the cache size and /3 is the inverse memory- 
cache bandwidth. 

6.3 Optimizing on-node performance 

Note that although our implementation performs above 
the classical peak performance, it performs well below the 
corresponding Strassen- Winograd peak, defined by the time 
it takes to perform c s n"°/P flops at the peak speed of each 
processor. To some extent, this is because Strassen is more 
demanding on the hardware, as noted above. However we 
have not yet analyzed whether the amount our performance 
is below Strassen peak can be entirely accounted for based 
on machine parameters. It is also possible that a high per- 
formance shared-memory Strassen implementation might 
provide substantial speedups for our implementation, as well 
as for 2D-Strassen and 2.5D-Strassen. 

6.4 Testing on various architectures 

We have implemented and benchmarked CAPS on only 
one architecture, a Cray XT4. It remains to check that 
it outperforms other matrix multiplication algorithms on a 
variety of architectures. On some architectures it may be 
more important to consider the topology of the network and 
redesign the algorithm to minimize contention, which we 
have not done. 

6.5 Improvements to the algorithm 

To be practically useful, it is important to generalize the 
number of processors on which CAPS can run. To attain 
the communication lower bounds, CAPS as presented in 
Section [3] must run on P a power of seven processors. Of 
course, CAPS can then be run on any number of processors 
by simply ignoring no more than | of them and incurring 
a constant factor overhead. Thus we can run on arbitrary 
P and attain the communication and computation lower 
bounds up to a constant factor. However the computation 
time is still dominant in most cases, and it is preferable 
to attain the computation lower bound exactly. It is an 
open question whether any algorithm can run on arbitrary 
P, attain the computation lower bound exactly, and attain 
the communication lower bound up to a constant factor. 

Moreover, although the communication costs of this al- 
gorithm match the lower bound up to a constant factor in 
bandwidth, and up to a logP factor in latency, it is an 
open question to determine the optimal constant in the lower 
bound and perhaps provide a new algorithm that matches it 



exactly. Note that in the analysis of CAPS in Section [3] the 
constants could be slightly improved. 

6.6 Parallelizing other algorithms 

6.6. 1 Another optimal classical algorithm 

We can apply our parallelization approach to recursive 
classical matrix multiplication to obtain a communication- 
optimal algorithm. This algorithm has the same asymptotic 
communication costs as the 2.5D algorithm [30]. We observed 
comparable performance to the 2.5D algorithm on our ex- 
perimental platform. As with CAPS, this algorithm has not 
been optimized for contention, whereas the 2.5D algorithm 
is very well optimized for contention on torus networks. 

6.6.2 Other fast matrix multiplication algorithms 

Our approach of executing a recursive algorithm in parallel 
by traversing the recursion tree in DFS (sequential) or BFS 
(parallel) manners is not limited to Strassen's algorithm. 
All fast square matrix multiplication algorithms are built 
out of ways to multiply no x no matrices using q < nf, 
multiplications. Like with Strassen and Strassen- Winograd, 
they compute q linear combinations of entries of each of A 
and B, multiply these pairwise, then compute the entries 
of C as linear combinations of theseQ CAPS can be easily 
generalized to any such multiplication, with the following 
modifications: 

• The number of processors P is a power of q. 

• The data layout must be such that all n§ blocks of A, B, 
and C are distributed equally among the P processors 
with the same layout. 

• The BFS and DFS determine whether the q multipli- 
cations are performed in parallel or sequentially. 

The communication costs are then exactly as above, but with 
= log„ q. 

It is unclear whether any of the faster matrix multiplica- 
tion algorithms are useful in practice. One reason is that 
the fastest algorithms are not explicit. Rather, there are 
non-constructive proofs that the algorithms exist. To imple- 
ment these algorithms, they would have to be found, which 
appears to be a difficult problem. The generalization of 
CAPS described in this section does apply to all of them, so 
we have proved the existence of a communication-avoiding 
non-explicit parallel algorithm corresponding to every fast 
matrix multiplication algorithm. We conjecture that the al- 
gorithms are all communication-optimal, but that is not yet 
proved since the lower bound proofs in 4 , 3 may not apply 
to all fast matrix multiplication algorithms. In cases where 
the lower bounds do apply, they match the performance of 
the generalization of CAPS, and so they are communication- 
optimal. 



'By 
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all fast matrix multiplication algorithms can be 



expressed in this bilinear form. 
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APPENDIX 

A. STRASSEN- WINOGRAD ALGORITHM 

The Strassen- Winograd algorithm is usually preferred to 
Strassen's algorithm in practice since it requires fewer addi- 
tions. We use it for our implementation of CAPS. Divide the 
input matrices A, B and output matrix C into 4 submatrices: 



An A i2 
-421 A22 



B - 



Bn B 12 
B21 B22 



C : 



C11 C12 
C21 C22 



Then form 7 linear combinations of the submatrices of each 
of A and B, call these Ti and Si, respectively; multiply 
them pairwise; then form the submatrices of C as linear 
combinations of these products: 



To 


= An 




S 


= Bn 




Qo 


= T 


So 




= Qo + Qs 


Ti 


= A12 




Si 


— B21 




Qi 


= Ti 


Si 


u 2 


= Ui+Qi 


T 2 


= A 2 i 


+ A 22 


S2 


= B\2 




Q2 


= T 2 


S 2 


Us 


= Ut+Q 2 


T 3 


= T 2 - 


- A12 


S3 


= B22 


-s 2 


Qs 


= T 3 


S3 


Cn 


= Qo + Qi 


Ti 


= An 


- A 2 i 


S4 


— B22 


— B\2 


Qi 


= Ti 


Si 


C12 


= U 3 + Q 5 


T 5 


= Aia 


+ T 3 


s 5 


— B22 




Qr, 


= n 


s 5 


C21 


= U 2 -Q 6 


T; 


= A 22 




S B 


= S3 - 


- B 2 i 


Qe 


= T 6 


s 6 


C22 


= U 2 + Q 2 



This is one step of Strassen- Winograd. The algorithm 
is recursive since it can be used for each of the 7 smaller 
matrix multiplications. In practice, one often uses only a 
few steps of Strassen- Winograd, although to attain O{n uo ) 
computational cost, it is necessary to recursively apply it all 
the way down to matrices of size O(l) X O(l). The precise 
computational cost of Strassen- Winograd is 



F(n) 



c s n 



5n 



Here c s is a constant depending on the cutoff point at which 
one switches to the classical algorithm. For a cutoff size of 
no, the constant is c s = (2no + 4)/n J0_ which is minimized 
at no = 8 yielding a computational cost of approximately 
3.7371"° -5n 2 . If using Strassen- Winograd with cutoff no this 
should be substituted into the computational cost expressions 
of Section [3] 



and Strassen-based bounds, it is necessary to consider three 
cases. 

Case 1. M = Q{n 2 /P) and M = 0(n 2 /P 2/ " ). The first 
condition is necessary for there to be enough memory to 
hold the input and output matrices; the second condition 
puts both classical and Strassen-based algorithms in the 
memory-dependent case. Then the ratio of the bandwidth 
costs is: 



R = e 



n 

PM 1 / 2 



n " 
PM^o/2- 



e 



(3-w )/2 



Using the two bounds that define this case, we obtain R = 

0(p(3"" 0)/2) and R = Q(p3/ W0 -1)_ 

Case 2. M = n(n 2 /P 2/ "°) and M = 0(n 2 /P 2/3 ). This 
means that in the classical case the memory-dependent lower 
bound dominates, but in the Strassen-based case the memory- 
independent lower bound dominates. Then the ratio is: 



r = e 



PM 1 / 2 / P 2 /"o 



= e 



1/2 



-,2/u 



Using the two bounds that define this case, we obtain R = 

O (p3/u -l) and R = Q(p2/ wo -2/3)_ 

Case 3. M = 0(P 2/3 ). This means that both the classi- 
cal and Strassen-based lower bounds are dominated by the 
memory-independent cases. Then the ratio is: 



r = e 



p2/3 / p2/uj 



e 



Overall, depending on the ratio of the problem size to the 
available memory, the factor by which the classical bandwidth 
costs exceed the Strassen-based bandwidth costs is between 

6 (p2/ W0 -2/3) and 6( -p(3-u, )/2^ 



B. COMMUNICATION-COST RATIOS 

In this section we derive the ratio R of classical to Strassen- 
based bandwidth cost lower bounds that appear in the begin- 
ning of Section[4] Note that both classical and Strassen-based 
lower bounds are attained by optimal algorithms. Similar 
derivations apply to the other ratios quoted in that section. 
Because the bandwidth cost lower bounds are different in 
the memory-dependent and the memory-independent cases, 
and the threshold between these is different for the classical 



